Model Selection

Lightweight VLM

# Lightweight VLM

Smolvlm 500M Anime Caption V0.2

A vision-language model specialized in describing anime-style images, fine-tuned based on SmolVLM-500M-Base

Image-to-Text English

Smolvlm 500M Anime Caption V0.1

A vision-language model specialized in describing anime-style images, fine-tuned from SmolVLM-500M-Base, trained on 180K synthetic image/caption pairs generated by large language models.

Image-to-Text English

Granite Vision 3.2 2b

granite-vision-3.2-2b is a compact and efficient vision-language model specifically designed for visual document understanding, capable of automatically extracting content from tables, charts, infographics, and more.

Transformers English

Paligemma 3b Ft Science Qa 448

PaliGemma is a 3B-parameter lightweight vision-language model developed by Google, built upon SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs.

Paligemma 3b Pt 448

PaliGemma is a lightweight and versatile vision-language model built on the SigLIP vision model and Gemma language model, supporting multilingual image-text interaction tasks.

Paligemma 3b Pt 896

PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.

Paligemma 3b Mix 448

PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs

Paligemma 3b Ft Docvqa 896

PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.

Paligemma 3b Ft Refcoco Seg 896

PaliGemma is a lightweight vision-language model developed by Google, built upon the SigLIP vision model and Gemma language model, supporting multilingual text generation and visual understanding tasks.

Paligemma 3b Mix 224

PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.

Paligemma 3b Pt 224

PaliGemma is a versatile lightweight vision-language model (VLM) built upon SigLIP vision model and Gemma language model, capable of processing both image and text inputs to generate text outputs.

Paligemma 3b Ft Vqav2 448

PaliGemma is a lightweight vision-language model developed by Google, combining image understanding and text generation capabilities, supporting multilingual tasks.

Paligemma 3b Ft Ocrvqa 448

PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase